Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite PSBaseParser and add an optimized in-memory version #1041

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

dhdaines
Copy link
Contributor

@dhdaines dhdaines commented Sep 19, 2024

The buffering in PSBaseParser (and associated code contortions like the mess that was PDFContentParser) was fragile but also unnecesary when doing file I/O - cPython's BufferedReader implementation is considerably faster, so we can just reimplement the "parser" (really a lexer) as a character-based state machine.

But this actually would lead to an overall slowdown, because in reality, most of the time we aren't parsing PDF data from a buffered file, but from a BytesIO wrapped around an in-memory buffer. In this case, the buffering is redundant but nonetheless faster in practice since it avoids the overhead of calling BytesIO.read repeatedly.

The obvious solution is to create a separate "parser" (really a lexer) using good old regular expressions the way our ancestors intended, and simply use this one when passed an in-memory buffer.

Also means that there is a bit less inheritance abuse in the code, as PSStackParser needs to delegate to the appropriate implementation.

Fixes: #885 and #1025

Also there were some details of the PDF parsing that were incorrect. Most notably, hex strings with odd length are supposed to be padded in big-endian fashion (i.e. <abcde> is equivalent to <abcde0>) but this was not the case in the existing code (which treated this as <abcd0e> instead).

Tested on the usual test suite with nox, profiled with cProfile and time.time.

Checklist

  • I have read CONTRIBUTING.md.
  • I have added a concise human-readable description of the change to CHANGELOG.md.
  • I have tested that this fix is effective or that this feature works.
  • I have added docstrings to newly created methods and classes.
  • I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant